B&B with Local Heaps#1149
Conversation
Remove dependency on rmm::mr::device_memory_resource base class. Resources now satisfy the cuda::mr::resource concept directly. - Replace shared_ptr<device_memory_resource> with value types and cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage - Replace set_current_device_resource(ptr) with set_current_device_resource_ref - Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref - Remove make_owning_wrapper usage - Remove dynamic_cast on memory resources (no common base class) - Remove owning_wrapper.hpp and device_memory_resource.hpp includes - Add missing thrust/iterator/transform_output_iterator.h include (no longer transitively included via CCCL)
…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # cpp/src/utilities/cuda_helpers.cuh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # ci/validate_wheel.sh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # cpp/src/branch_and_bound/mip_node.hpp # cpp/src/branch_and_bound/pseudo_costs.cpp
Kh4ster
left a comment
There was a problem hiding this comment.
Very cool results! Thanks @nguidotti !
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test 207fab3 |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # cpp/src/mip_heuristics/mip_constants.hpp
|
/ok to test cddb787 |
| break; | ||
| } | ||
| } | ||
| lower_bound = node_queue_.best_first_queue_size() > 0 ? node_queue_.get_lower_bound() |
There was a problem hiding this comment.
Is this correct? Above you are pushing nodes into the worker queue.
But you removed this line that gets the lower bound from the queue if there are nodes in the queue.
Maybe this is happening in this line:
lower_bound = std::min(lower_bound, worker->node_queue.get_lower_bound());
But it's difficult to follow. Maybe split it into two loops. The first loop, goes through the nodes in the queue and fathoms them.
The second loops of over the workers to compute the lower bound.
|
|
||
| // We need to temporarily save the lower bound in this worker so it is | ||
| // considered when calculating the global lower bound. | ||
| this->lower_bound = std::min<f_t>(this->lower_bound, other->node_queue.get_lower_bound()); |
There was a problem hiding this comment.
I'm confused as to what is happening here. Again the comment says we are temporarily saving the lower bound of the worker. But then we don't reset the lower bound
chris-maes
left a comment
There was a problem hiding this comment.
Thanks @nguidotti . I took a brief pass at this PR.
My biggest comment is that after this PR, I find it very hard to reason about the lower bound. We've had many bugs where the lower bound is incorrect or lost. I'm concerned that unless we make the handling of the lower bound as clear as possible to follow there will be more bugs. There are so many places in the code that now track the lower bound. For instance, there is a lower bound member in the node_queue_t. There is a lower bound member in bfs_worker_t (really in branch_and_bound_worker_t), there is a local variable named lower_bound in best_first_search_with there is a function called get_lower_bound, there is the lower bound ceiling.
Is there anyway to simplify this? For instance, do we need to have the lower bound stored in the node_queue_t and the worker? Could we only store it one place?
Can we make clear the comments around temporarily setting the lower bound?
…bound` Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…nsidered in `get_lower_bound` Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Unfortunately, no. Let forget the parallel execution for a moment. When we are doing best-first search with plunges, there are two places where we need to consider the lower bound: the top of the heap ( |
|
For computing the lower bound for the entire solver, you take |
|
There are other things, we also need to consider. For instance, we might encounter a numerical issue in a given node, which blocks the solver to progress further down that path. Similarly, starting a new worker or stealing a node, the node needs to be transferred from one worker to the next (pop, then push the best node in the heap). In the meanwhile, the lower bound of that node needs to be store somewhere so it can be considered when computing the global lower bound. |
|
I tried to concentrate all the global lower bound computation in the |
|
/ok to test 02d0381 |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test e212f63 |
aliceb-nv
left a comment
There was a problem hiding this comment.
Excellent work Nicolas :) Mostly some nitpicks.
I've run your PR on Eos and I see material improvements:
================================================================================
THEORETICAL BEST RESULTS (Combining Best from Both Runs)
================================================================================
Baseline Run:
Feasible count: 226
Gap <= 0.1%: 116
Average primal gap: 0.135526
Average dual gap: 0.223231
Optimal (MIP gap < 1e-4): 60
Compared Run:
Feasible count: 229
Gap <= 0.1%: 115
Average primal gap: 0.121167
Average dual gap: 0.213514
Optimal (MIP gap < 1e-4): 66
Theoretical Best (if all regressions fixed):
Feasible count: 229
Gap <= 0.1%: 124
Average gap: 0.111041
Potential Improvement:
Feasible count improvement: 0
Average gap improvement: 0.024485
================================================================================
================================================================================
INSTANCES SOLVED TO OPTIMALITY BY COMPARED RUN BUT NOT BY BASELINE (9 instances)
================================================================================
Instance Baseline MIP Gap Compared MIP Gap
--------------------------------------------------------------------------------
ns1952667 100.0000% 0.0000%
peg-solitaire-a3 100.0000% 0.0000%
neos-5188808-nattai 89.0663% 0.0000%
neos-950242 25.0000% 0.0000%
rail507 0.5747% 0.0000%
neos-4722843-widden 0.3150% 0.0000%
neos-4738912-atrato 0.0100% 0.0099%
gen-ip002 0.0100% 0.0099%
binkar10_1 0.0100% 0.0099%
================================================================================
INSTANCES SOLVED TO OPTIMALITY BY BASELINE BUT NOT BY COMPARED RUN (3 instances)
================================================================================
Instance Baseline MIP Gap Compared MIP Gap
--------------------------------------------------------------------------------
ns1208400 0.0000% 100.0000%
triptim1 0.0000% 1.3885%
neos-933966 0.0000% 0.9346%
| case LINE_SEARCH_DIVING: return settings.line_search_diving != 0; | ||
| case GUIDED_DIVING: return settings.guided_diving != 0 && has_incumbent; | ||
| case COEFFICIENT_DIVING: return settings.coefficient_diving != 0; | ||
| default: return false; |
There was a problem hiding this comment.
We shouldn't add a default clause to most switch statements IMO - this allows the compiler to -Werror at us whenever we add a new enum value and we forget to update some of these switches :)
| void set_inactive() { this->is_active = false; } | ||
|
|
||
| // Steal nodes from another worker | ||
| bool steal_node_from(bfs_worker_t* other, i_t num_nodes) |
There was a problem hiding this comment.
Let's use a reference for 'other' since it doesn't semantically make sense for it to be nullptr
| return steal; | ||
| } | ||
|
|
||
| while (num_nodes > 0) { |
There was a problem hiding this comment.
As a side note - do we (or the OMP runtime) have a way to measure and report the amount of contention on omp locks?
|
|
||
| private: | ||
| std::vector<T> buffer; | ||
| omp_atomic_t<size_t> num_entries_{0}; |
There was a problem hiding this comment.
Why was this added as an atomic?
| void push(mip_node_t<i_t, f_t>* new_node) | ||
| { | ||
| assert(new_node != nullptr); | ||
| auto entry = std::make_shared<heap_entry_t>(new_node); |
There was a problem hiding this comment.
What are the ownership rules regarding new_node? Is a std::shared_ptr required here instead of unique_ptr + ownership moves?
I realize this is existing code, I'm just struggling to remember why we made these choices :)
| /* clang-format on */ | ||
|
|
||
| #pragma once | ||
| #include <array> |
There was a problem hiding this comment.
nit: maybe we could stick to a C array here to avoid having to pull in C++ machinery for such a very common header, to keep build times slim
| worker->node_queue.lock(); | ||
| worker->node_queue.push(node); | ||
| worker->node_queue.unlock(); |
There was a problem hiding this comment.
Maybe we could turn this into a push_atomic() primitive. That'd lessen the risk of accidentally adding a racey push later down the line. If we want to keep the ability to do unlocked pushes, maybe we could make it explicit in the naming, e.g. push_nolock()?
| worker->set_inactive(); | ||
| bfs_worker_pool_.return_worker_to_pool(worker); |
There was a problem hiding this comment.
Does "return_worker_to_pool" always imply "worker" is set inactive? If so, it may be safer to just have "return_worker_to_pool" unconditionally mark the worker as inactive
In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.
This also implements a compression scheme for
vstatususing only2bitsper entry, which reduces the memory consumption by roughly4x(previously was usingint8_tper entry). Last, but not least, this PR replacesstd::dequewith a fixed-capacitycircular_deque_tfor the plunge/dive stacks and the idle-worker list.MIPLIB results (GH200, 10min):
In summary, we explored
~3xnodes in average` at the same time frame. The number of optimal solutions also increased by 3.Checklist